HitCount Lifecycle: experimental CRAN status Codecov test coverage Downloads

Introduction

The volcano3D package enables exploration of probes differentially expressed between three groups. Its main purpose is for the visualisation of differentially expressed genes in a three-dimensional volcano plot. These plots can be converted to interactive visualisations using plotly.

This vignette will consist of a number of case studies focusing on the PEAC rheumatoid arthritis study (Pathobiology of Early Arthritis Cohort). The methodology has been published in ‘Lewis, Myles J., et al. “Molecular portraits of early rheumatoid arthritis identify clinical and treatment response phenotypes.” Cell reports 28.9 (2019): 2455-2470.’ (DOI: 10.1016/j.celrep.2019.07.091) with an interactive web tool available at https://peac.hpc.qmul.ac.uk.

Getting Started

Prerequisites

Install from CRAN

Not yet publicly available:

install.packages("volcano3D")

Install from Github

library(devtools)
install_github("KatrionaGoldmann/volcano3D")
library(volcano3D)

Sample Data

The sample data can then also be installed (this can only be done after volcano3D is imported)

install.packages("volcano3Ddata")

volcano3D data

The PEAC data for this package is automatically loaded from the volcano3Ddata package repo (source).


Dictionary

Variables used in this vignette:

Variable Definition
contrast the variable by which samples can be split into three groups.
groups the three levels/categories of the contrast variable.
comparison two groups between which a statistical test can be performed. There are three comparisons total. For the examples outlined in this vignette we look at comparisons: ‘lymphoid-myeloid’, ‘lymphoid-fibroid’ and ‘myeloid-fibroid’.
p p value
FC fold change
padj adjusted p value
suffix the tail word in a column name. In this package it states the statistical parameter (e.g. _logFC is the log FC variable)
prefix the leading word in a column name. In this package it states the statistical test (e.g. LRT is the likelihood ratio test).
dep Differential Expression P-values’ object, of S4 class, containing the expression data, sample data and pvalues.

Examples

This vignette will demonstrate the power of this package using a basic example from the PEAC data set. This analysis is separated by biopsy/tissue type: blood and synovium. Here we will focus on the synovial data. This is stored in the Git repo volcano3Ddata.

1. Synovial Gene Data

First we will set up a differential expression pvalues object (dep) containing:

  1. the expression data with columns representing different samples and rows representing different variables
  2. the sample data which contains information for each sample in rows.
  3. the pvalues which contains the statistical significance of probes between groups.

Creating dep objects

To create a dep object you must provide the ‘create_dep’ function with:

  • sampledata: A data frame containing the sample information. This must contain:
    • an ID column: Containing the sample IDs. This must be titled ‘ID’.
    • a contrast column: A column containing the three-level factor used for contrasts.
  • contrast: The column name in sampledata which contains the three-level factor used for contrast.
  • pvalues: A data frame containing:
    • three pvalue columns: one for each comparison between groups (column names of format paste0(groups[i], "-", groups[j], " ", pColSuffix)). We recommend using limma or DESeq pipelines to calculate these pvalues for gene expression.
    • an optional multi-group pvalue column: from a multi-group test (column names of form paste(multi_group_prefix, p_col_suffix)). This is typically generated using ANOVA or likelihood ratio tests between all three groups.
    • three fold change columns: one for each comparison between groups (column names of form paste0(groups[i], "-", groups[j], " ", fcColSuffix))
    • an optional multi-group fold change column: from a multi-group test (column names of form paste(multi_group_prefix, fc_col_suffix)). This is typically generated using ANOVA or likelihood ratio tests comparing all three groups.
    • three optional adjusted pvalue columns: one for each comparison between groups (column names of form paste0(groups[i], "-", groups[j], " ", padj_col_suffix))
    • an optional multi-group adjusted pvalue column: from a multi-group test (column names of form paste(multiGroupPrefix, padjColSuffix)). This is typically generated using ANOVA or likelihood ratio tests comparing all three groups.

with optional:

  • groups: The groups to be compared (in order). If NULL this defaults to levels(sampledata[, 'contrasts']).
  • p_col_suffix: The suffix of column names with pvalues (default is ‘pvalue’).
  • padj_col_suffix: The suffix of column names with adjusted pvalues (default is ‘padj’). If NULL the adjusted pvalue is calculated using p_col_suffix and pvalue_method.
  • padjust_method: The method to calculate adjusted pvalues if not already provided. Must be one of c(‘holm’, ‘hochberg’, ‘hommel’, ‘bonferroni’, ‘BH’, ‘BY’, ‘fdr’, ‘none’). Default is ‘BH’.
  • fc_col_suffix: The suffix of column names with log(fold change) values (default is ‘logFC’).
  • multi_group_prefix: The prefix for columns containing statistics for a multi-group test (this is typically a likelihood ratio test or ANOVA). Default is NULL.

Using the synovial biopsies from PEAC we can create a dep object for differentially expressed genes. First we will install the volcano3Ddata package if necessary:

# Install the data package if required
if(! "volcano3Ddata" %in% installed.packages()[,"Package"]){
  install.packages("volcano3Ddata")
}

Then the data can be loaded:

library(volcano3Ddata)
data("syn_data") 

syn_p_obj <- create_dep(sampledata = syn_metadata, 
                       contrast = "Pathotype", 
                       pvalues = syn_pvalues,
                       p_col_suffix = "pvalue", 
                       fc_col_suffix = "log2FoldChange",
                       multi_group_prefix = "LRT", 
                       expression = syn_rld)

The pvalues slot should have three statistics for each comparison: pvalue, adjusted pvalue and logarithmic fold change:

head(syn_p_obj@pvalues) %>%
  kable() %>%
  kable_styling(font_size=8.7)
Fibroid-Lymphoid logFC Fibroid-Lymphoid padj Fibroid-Lymphoid pvalue LRT logFC LRT padj LRT pvalue Lymphoid-Myeloid logFC Lymphoid-Myeloid padj Lymphoid-Myeloid pvalue Myeloid-Fibroid logFC Myeloid-Fibroid padj Myeloid-Fibroid pvalue
A2M -0.0478326 1 0.7775361 0.2681094 0.4353436 0.2739571 -0.2202768 1.0000000 0.1563045 0.2681094 1 0.1694654
A2ML1 1.6854024 1 0.0002106 0.2445664 0.0000202 0.0000022 -1.9299687 0.0464724 0.0000030 0.2445664 1 0.6348844
A4GALT 1.0248506 0 0.0000000 -0.4601841 0.0000000 0.0000000 -0.5646665 0.2927492 0.0000192 -0.4601841 1 0.0055222
A4GNT -1.1527859 1 0.0238302 1.3385294 0.1583022 0.0732984 -0.1857435 1.0000000 0.6691773 1.3385294 1 0.0197330
AAAS -0.0245296 1 0.8574281 -0.0303402 0.9711813 0.9075226 0.0548697 1.0000000 0.6608837 -0.0303402 1 0.8469982
AACS 0.1729592 1 0.2815302 0.1002448 0.2732939 0.1472059 -0.2732040 1.0000000 0.0630182 0.1002448 1 0.5874802

Volcano Plots

We can now investigate the comparisons between pathotypes using the volcano_trio function. This creates three ggplot outputs.

syn_plots <- volcano_trio(dep = syn_p_obj,
                          sig_names = c("not significant","significant",
                                        "not significant","significant"),
                          colours = rep(c("grey60",  "salmon"), 2),
                          text_size = 9,
                          shared_legend_size = 0.9, 
                          label_rows = c("SLAMF6", "PARP16", "ITM2C"),
                          fc_line = FALSE)

syn_plots$All

Polar Plots

Next the coordinates and colours to map each variable, or row, in the expression/pvalues data onto a polar coordinate system is calculated using the polar_coords function:

syn_polar <- polar_coords(dep = syn_p_obj)

The sig column in syn_polar@polar allows us to determine relative differences in expression between pathotypes. The ‘+’ indicates which pathotypes are significantly ‘up’ compared to others. For example:

  • genes labelled ‘Lymphoid+’ are significantly up in Lymphoid vs Myeloid and Lymphoid vs Fibroid. Genes such as these which are up in only one pathotype are coloured according to the primary_colours variable in polar_coords.
  • genes up in two pathotypes such as ‘Lymphoid+Myeloid+’ are up in both Lymphoid and Myeloid, therefore Lymphoid vs Fibroid and Myeloid vs Fibroid are statistically significant. These genes are coloured according to the secondary_colours.
  • genes which show now significant difference between pathotypes (The ‘Not Significant’ genes) are coloured not_sig_colour.
table(syn_polar@polar$sig) %>%
  kable(col.names = c("Significance", "Frequency")) %>%
  kable_styling(full_width = F)
Significance Frequency
Fibroid 885
Fibroid+Lymphoid+ 19
Fibroid+Myeloid+ 1504
Lymphoid 1793
Lymphoid+Myeloid+ 500
Myeloid 119
Not Significant 11415

These can be output to an interactive radar plot using radial_plotly. The labelRows variable allows any markers of interest to be labelled.

radial_plotly(polar = syn_polar, 
              fc_cutoff = 0.1, 
              label_rows = c("SLAMF6", "PARP16", "ITM2C"))

By hovering over certain point you can also determine genes for future interrogation.

Similarly we can create a static ggplot image using radial_ggplot:

radial_ggplot(polar = syn_polar, 
              fc_cutoff = 0.1, 
              label_rows = c("SLAMF6", "PARP16", "ITM2C"),
              marker_size = 2.3, 
              legend_size = 10) 

Boxplots

We can then interrogate any one specific variable as a boxplot, to investigate these differences. This is build using ggplot2 so can easily be edited by the user to add features.

plot1 <- boxplot_trio(syn_p_obj, 
                     value = "SLAMF6", 
                     test = "wilcox.test", 
                     levels_order = c("Lymphoid", "Myeloid", "Fibroid"), 
                     box_colours = c("blue", "red", "green3"))

plot2 <- boxplot_trio(syn_p_obj, 
                     value = "PARP16", 
                     test = "wilcox.test", 
                     levels_order = c("Myeloid", "Fibroid"), 
                     box_colours = c("red", "green3"))

plot3 <- boxplot_trio(syn_p_obj, 
                     value = "ITM2C", 
                     test = "wilcox.test", 
                     levels_order = c("Lymphoid", "Myeloid", "Fibroid"), 
                     my_comparisons = list(c("Lymphoid", "Myeloid"), 
                                           c("Myeloid", "Fibroid")),
                     box_colours = c("blue", "red", "green3"))

require(ggpubr)
ggarrange(plot1, plot2, plot3, ncol = 3)

Three Dimensional Volcano Plots

The final thing we can look at is the 3D volcano plot which projects differential gene expression onto cylindrical coordinates.

library(plotly)
p <- volcano3D(syn_polar, 
              label_rows = c("SLAMF6", "PARP16", "ITM2C"), 
              label_size = 10, 
              xy_aspectratio = 1, 
              z_aspectratio = 0.9)

p %>% layout(legend = list(x = 100, y = 0.5))

Again this produces an interactive plot. If you have the orca command-line utility installed, this can be used to save static images. To install follow the instructions here.

orca(p, "./volcano_3d_synovium.svg", format = "svg")

2. Synovial Modular Data

We can then collapse this into a modular analysis using a list of gene sets. In this example we have used the blood transcript modules curated by Li et. al. in ‘Li, S., Rouphael, N., Duraisingham, S., Romero-Steiner, S., Presnell, S., Davis, C., … & Kasturi, S. (2014). Molecular signatures of antibody responses derived from a systems biology study of five human vaccines. Nature immunology, 15(2), 195.’. The pvalues were generated using QuSAGE methodology.

Creating dep object

syn_mod_p_obj <- create_dep(sampledata = syn_metadata,
                           contrast = "Pathotype",
                           pvalues = syn_mod_pvalues,
                           p_col_suffix = "p.value",
                           padj_col_suffix = "q.value",
                           fc_col_suffix = "logFC",
                           multi_group_prefix = NULL,
                           expression = syn_mod)

Volcano Plots

We can now investigate the comparisons between pathotypes using the volcano_trio function.

syn_mod_plots <- volcano_trio(dep = syn_mod_p_obj,
                              label_rows = c("M156.0", "M37.2"), 
                              shared_legend_size = 1, 
                              sig_names = c("Not Sig", 
                                            paste("Padj <", 0.05), 
                                            paste("|FC| >", 1), 
                                            paste("Padj <", 0.05, 
                                                  "&\n|FC| >", 1)))

syn_mod_plots$All

Polar Plots

Next the coordinates and colours to map each variable, or row, in the expression/pvalues data onto a polar coordinate system is calculated using the polar_coords function:

syn_polar <- polar_coords(dep = syn_mod_p_obj, significance_cutoff = 0.01)

table(syn_polar@polar$sig) %>%
  kable(col.names = c("Significance", "Frequency")) %>%
  kable_styling(full_width = F)
Significance Frequency
Fibroid 15
Fibroid+Myeloid+ 43
Lymphoid 101
Lymphoid+Myeloid+ 65
Myeloid 9
Not Significant 112

These can be output to an interactive radar plot using radial_plotly. The labelRows variable allows any markers of interest to be labelled.

radial_plotly(polar = syn_polar,
              fc_cutoff = 0.3,
              label_rows = c("M156.0", "M37.2")) %>%
  layout(width = 650, height = 650)

Or a ggplot static image using radial_ggplot:

radial_ggplot(polar = syn_polar,
              fc_cutoff = 0.1,
              label_rows = c("M156.0", "M37.2"),
              marker_size = 2.7,
              label_size = 5,
              axis_lab_size = 3,
              axis_title_size = 5,
              legend_size = 10) +
  theme(legend.position = "right")

Boxplots

We can then interrogate any one specific variable as a boxplot, to investigate these differences.

plot1 <- boxplot_trio(syn_mod_p_obj,
                     value = "M156.0",
                     test = "wilcox.test",
                     levels_order = c("Lymphoid", "Myeloid", "Fibroid"),
                     box_colours = c("blue", "red", "green3"))

plot2 <- boxplot_trio(syn_mod_p_obj,
                     value = "M37.2",
                     test = "wilcox.test",
                     levels_order = c("Lymphoid", "Myeloid", "Fibroid"),
                     box_colours = c("blue", "red", "green3"))

ggarrange(plot1, plot2)


Citation

If you use this package please cite as:

Lewis, Myles J., et al. “Molecular portraits of early rheumatoid arthritis identify clinical and treatment response phenotypes.” Cell reports 28.9 (2019): 2455-2470.

or using:

citation("volcano3D")
## 
## To cite package 'volcano3D' in publications use:
## 
##   Katriona Goldmann and Myles Lewis (2020). volcano3D: Interactive
##   Plotting of Three-Way Differential Expression Analysis. R package
##   version 0.1.0.9000. https://github.com/KatrionaGoldmann/volcano3D
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {volcano3D: Interactive Plotting of Three-Way Differential Expression
## Analysis},
##     author = {Katriona Goldmann and Myles Lewis},
##     year = {2020},
##     note = {R package version 0.1.0.9000},
##     url = {https://github.com/KatrionaGoldmann/volcano3D},
##   }